AD-A104  174 

unclassified 


CARNEG1 E-MELLON  UNIV  PITTSBURGH  PA  DEPT  OF  STATISTICS  F/G  12/1 

ASSESSING  PROBABILITY  ASSESSORS:  CALIBRATION  AND  REFINEMENT. (U) 

UUL  01  M  H  DEGROOTf  S  E  FIENBERG  N00014-80-C-0637 

TR-205  NL 


DEPARTMENT 

OF 

STATISTICS 


Appra.«J  to  *■“? 

Distrito'-to*  Unlunl'^ 


Carnegie -Mellon  University 

PITTSBURGH,  PENNSYLVANIA  15213 

61  9  14  030 


I 


Assessing  Probability  Assessors: 
Calibration  and  Refinement 

by 

Morris  H.  DeGroot 
and 

Stephen  E.,Fienberg 


Department  of  Statistics 
Carnegie-Mellon  University 
Pittsburgh,  PA  15213 

Technical  jteport  No.  205  I  ^ 

May,  1981 
Revised  July*  1981 

■  -  j. 

pjrwrrrmi  fcr.  r'-“' 

tJiatrif  v.ti  ; 


The  preparation  of  this  paper  was  partially  supported  by  the  National 
Science  Foundation  under  grant  SES-7906386  and  by  the  Office  of  Naval 
Research  Contract  N0014-80-C-0637  at  Carnegie-Mellon  University. 
Reproduction  in  whole  or  in  part  is  permitted  for  any  purpose  of  the 
United  States  Government. 


\ 

V 


1.  Introduction 


You  have  just  been  hired  by  the  management  of  a  local  television  station 
to  assist  them  in  evaluating  the  candidates  for  a  soon-to-be- filled  position 
as  station  weatherman.  Each  of  the  candidates  has  made  a  sequence  of  prob¬ 
ability  forecasts  of  the  event  "rain",  announcing  the  probability  p.  on 


j 


,  th 


the  j  trial  of  the  sequence.  Before  making  the  next  forecast  the  can¬ 
didate  learns  the  value  of  y^  ,  which  is  1  if  "rain"  occurs ,  and  is  0 
otherwise.  The  basic  data  available  to  you  for  each  candidate  is  a  set  of 
pairs  {(Pj»Yj):  j=l,2,...,  n},  and  from  this  information  you  are  to  assess 
the  candidates,  and  possibly  determine  which  is  the  best  probability  assessor. 

The  purpose  of  this  paper  is  to  provide  a  probabilistic  framework  into  which 
to  set  this  problem  of  assessing  probability  assessors. 

In  the  weatherman  problem,  we  have  taken  care  to  ensure  that  each  fore¬ 
cast  is  made  in  light  of  full  information  of  the  outcome  of  previous  forecasts, 
i.e.,  with  feedback.  From  a  subjective  probability  perspective  the  announced 
probability  forecasts  form  a  sequence  of  conditional  probabilities  in  which 
each  term  expresses  the  candidate's  degree  of  belief  given  all  of  the  infor¬ 
mation  available  at  the  time  of  the  forecast.  The  probability  distribution 
of  thi-se  conditional  probabilities,  found  by  letting  the  number  of  trials  n-**>  » 
is  of  central  concern  in  this  paper. 

The  notion  of  calibration  concerns  the  relationship  between  the  probability 
distribution  of  conditional  probabilities  and  the  long-run  frequencies  of  rain 
given  a  particular  probabilistic  assessment  value.  Roughly  speaking  a  prob¬ 
ability  assessor  is  said  to  be  well-calibrated  if,  for  those  trials  on  which  he 
forecasts  the  probability  x  ,  the  long-run  frequency  of  rain  is  x  .  Pratt  (1962) 


and  Dawid  (1981  )  show  that  a  probability  assessor  who  is  coherent  in  the  sense 


of  de  Finetti  (1937)  must  be  well-calibrated  almost  surely.  In  Section  2,  we  make  more 

formal  this  notion  of  calibration,  and,  in  Section  3,  we  show  that  some  well- 

calibrated  forecasters  are  clearly  superior  to  others.  We  suggest  a  formal 

sense  in  which  a  given  well-calibrated  forecaster  can  be  "more  refined"  and 

thus  "better"  than  another.  Then  in  Section  4,  we  demonstrate  the  link 

between  the  concept  of  refinement  and  that  of  sufficiency  in  the  comparison 

of  experiments.  This  link  leads  in  Section  5  to  a  rather  simple  condition 

for  determining  whether  one  well-calibrated  forecaster  is  more  refined  than 

another.  In  Section  6,  this  condition  is  exploited  in  order  to  determine 

a  "least-refined"  forecaster. 

Calibration  and  refinement,  as  presented  in  this  paper, refer  only  to 
the  full  probability  distribution  of  the  assessor's  conditional  forecasts. 

However,  in  the  television  station  example  which  began  this  section,  and 
elsewhere  in  statistical  practice,  we  do  not  know  either  this  distribution 
or  the  long-run  frequencies  of  rain.  All  that  we  get  to  see  is  a  finite 
set  of  forecasts  and  the  associated  indicators  of  whether  or  not  rain  occurred, 
i.e. ,  {(Pj»  y^):  j  *  1,2,...,  n}.  In  Section  7,  we  briefly  review  some 
scoring  rules  suggested  for  such  sample  situations,  and  indicate  how  they 
relate  to  the  probabilistic  concepts  of  calibration  and  refinement. 

For  the  forecasting  problems  considered  in  the  first  six  sections  of 
the  paper  there  are  only  two  possible  outcomes,  rain  or  no  rain.  We  take 
care  in  these  sections  to  preserve  the  orientation  of  outcomes  and  work 
only  with  the  forecasters'  assessments  of  the  probability  of  rain.  Kadane 
and  Lichtenstein  (1981)  show  that  the  loss  of  orientation  leads  to  the  in¬ 
ability  to  recalibrate  a  forecaster's  assessments.  Finally,  in  Section  8 
we  discuss  extensions  of  the  calibration  and  refinement  structure  to  forecasting 
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problems  with  s>2  outcomes.  In  these  problems,  we  require  the  ordered 
vector  of  assessed  probabilities  for  the  s  possible  outcomes  and  the 
associated  indicator  vector  which  summarizes  which  outcome  actually  occurs. 

2.  Well-calibrated  forecasters 

Consider  a  weather  forecaster  who  day  after  day  must  specify  his  subjective 
probability  x  that  there  will  be  at  least  a  certain  amount  of  rain  at  some 
given  location.  For  simplicity,  we  shall  refer  to  the  occurrence  of  this 
well-specified  event  as  "rain."  Thus,  we  may  say  simply  that,  at  the  beginning 
of  each  day,  the  forecaster  must  specify  his  probability  of  rain,  and  that  at 
the  end  of  each  day  it  is  observed  whether  or  not  rain  actually  did  occur. 

We  shall  refer  to  the  probability  x  specified  by  the  forecaster  on 
any  particular  day  as  his  prediction.  Both  for  realism  and  simplicity,  we 
assume  that  the  prediction  x  is  restricted  to  a  finite  set  of  values 
0=Xq<x^<.  . . <x^*l  .  (In  many  weather  forecasts,  k**10  and  x^-j/10.)  We 
assume  that  the  forecaster's  predictions  can  be  observed  over  a  large  number 
of  days,  and  we  shall  let  v(x)  denote  the  probability  function  (or  frequency 
function)  of  his  predictions  over  those  days.  Thus,  we  can  think  of  v(x) 
either  as  the  probability  that  the  forecaster's  prediction  on  a  randomly  chosen 
day  will  be  x  ,  or  in  the  frequency  sense  as  the  proportion  of  days  on  which 
his  prediction  is  x  .  We  shall  let  %  denote  the  set  of  possible  predictions 
{Xq,x^, . . . .x^}  and  let  X+  denote  the  subset  of  X  containing  only  those 
points  for  which  v(x)>  0  . 

To  evaluate  the  forecaster,  we  must  compare  the  actual  occurrences  of 
rain  or  no  rain  with  his  predictions,  and  for  xeX+  we  shall  let  p(x)  denote 


the  conditional  probability  of  rain  given  that  the  prediction  is  x  .  The 
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forecaster  is  said  to  be  well-calibrated  (see,  e.g.,  Dawid,  1981  )  if 
p(x)  »  x  for  all  values  of  xeZ+  .  In  words,  the  forecaster  is  well- 
calibrated  if  among  all  those  days  for  which  the  prediction  is  x  ,  the 
proportion  of  rainy  days  is  also  x  ,  and  this  is  true  for  every  value 
of  x  .  In  meteorology,  the  criterion  of  calibration  is  referred  to  as 
validity  (Miller,  1962),  or  reliability  (Murphy,  1973),  and  the  well- 
calibrated  forecaster  is  said  to  be  perfectly  reliable. 

For  obvious  reasons,  being  well-calibrated  is  usually  regarded  as  a 
desirable  characteristic  of  a  forecaster.  It  has  been  pointed  out  elsewhere 
(DeGroot,  1979)-,  however,  that  it  is  typically  easy  for  any  forecaster  to 
make  himself  well-calibrated  by  specifying  predictions  that  do  not  represent 
his  subjective  probabilities  and  in  which  he  does  not  believe.  Furthermore, 
as  Dawid  (1980  )  has  stated,  even  if  the  forecaster's  true  probabilities  make 
him  well-calibrated,  "this  does  not  necessarily  mean  that  they  are  'accurate' 
in  all  respects;  and  even  if  they  are  accurate,  they  may  not  be  of  much 
substantive  value  if  the  forecaster  is  a  poor  meteorologist."  Thus,  a  well- 
calibrated  forecaster  is  not  necessarily  a  good  forecaster,  and  we  shall  now 
consider  the  problem  of  comparing  well-calibrated  forecasters. 


3.  Refinement 

In  this  section,  we  shall  restrict  attention  to  well-calibrated  fore¬ 
casters.  Let  vi  denote  the  relative  frequency  of  days  on  which  it  rains.  In 
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meteorology,  P  is  sometimes  called  the  climatological  probability.  For 


any  well-calibrated  forecaster,  it  must  be  true  that 

Z  xv(x)  *  p  . 
xe% 

Throughout  this  paper  we  shall  assume  that  0  <  p  <  1  . 


(3.1) 


In  order  to  emphasize  the  possible  differences  that  can  exist  among 
such  forecasters,  we  shall  begin  by  considering  two  extreme  types.  Suppose 
that  \itZ  .  Then  the  forecaster  whose  prediction  each  day  is  p  will 

be  well-calibrated,  although  his  predictions  are  completely  useless  for  any 
purpose  whatsoever.  The  predictions  of  A^  are  characterized  by  the  de¬ 
generate  probability  function 


v  (p)  =  1  , 

u 


v.  (x)  *  0  for  x  ^  p 

Ao 


*  (3.2) 


When  pe^  ,  we  shall  refer  to  AQ  as  the  least-refined  forecaster. 

Next,  consider  a  well-calibrated  forecaster  A^  whose  predictions  are 

characterized  by  the  following  probability  function: 

v  0(1)  -  P  > 

A 


v  q(0>  =  1-p  , 
A 


(3.3) 


v  n(x)  *  0  for  x  J  0,1  . 
A 


It  can  be  seen  from  (3.3)  that  the  only  probabilities  of  rain  that  forecaster 
A^  ever  specifies  are  0  and  1  ,  and  since  A^  is  well-calibrated,  his 
predictions  are  always  correct.  We  shall  refer  to  A^  as  the  most-refined 
forecaster.  In  meteorology,  A^  is  said  to  exhibit  zero  sharpness .  and  A° 
to  exhibit  perfect  sharpness  (Sanders,  1963). 
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It  is  clear  from  Ag  and  that  quite  different  types  of  behavior 

are  possible  among  well-calibrated  forecasters,  ranging  from  useless  to 
perfect  predictions.  We  shall  now  describe  a  concept  that  yields  a  partial 
ordering  on  the  class  of  all  well-calibrated  forecasters  and  justifies  our 
referring  to  Ag  and  A^  as  the  least  and  the  most  refined  members  of  this 
class. 

A  stochastic  transformation  h(x|y)  is  a  function  defined  on  such 

that 


h(x|y)  xe£  and  yeX  , 

(3.4) 

I  h(xjy)  *  1  for  ye£  . 
xeZ 


Now  consider  two  arbitrary- well-calibrated  forecasters  whose  predictions  are 

characterized  by  the  probability  functions  v .  (x)  and  v  (x)  .  We  say  that 

A  D 

A  is  at  least  as  refined  as  B  if  there  exists  a  stochastic  transformation 
h  such  that  the  following  relations  are  satisfied: 


Z  h(x|y)vA(y)  *  VB(X)  for  xe£  ,  (3.5) 

y  €C 

Z  h(x|y)yv  (y)*  xv  (x)  for  xeZ  .  (3.6) 

A  a 

ye* 

By  subtracting  (3.6)  from  (3.5)  we  get 

Z  h(x|y) (1-y)  vA(y)  -  (1-x)  vB(x)  for  xeZ  ,  (3.7) 

yeZ 

which  adds  a  touch  of  symmetry  when  (3.7)  is  paired  with  (3.6). 

Together,  the  relations  (3.5)  and  (3.6),  or  (3.6)  and  (3.7),  state  that 
if  we  know  the  predictions  of  forecaster  A  ,  then  we  can  simulate  the  pre¬ 


dictions  of  forecaster  B  by  using  an  auxiliary  randomization  based  on  the 
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stochastic  transformation  h  as  follows:  If  A  makes  the  prediction 
y  on  a  particular  day,  then  we  simulate  a  prediction  x  in  accordance 
with  the  conditional  probability  distribution  h(x|y).  The  prediction 
x  will  then  have  exactly  the  same  probabilistic  properties  as  the  pre¬ 
dictions  of  forecaster  B  .  The  relation  (3.5)  guarantees  that  we  will 
obtain  each  prediction  x  with  the  same  frequency  v  (x)  that  B  does, 

O 

and  the  relation  (3.6)  guarantees  that  our  predictions  will  still  be  well- 
calibrated. 

To  see  that  any  forecaster  A  is  at  least  as  refined  as  the  least- 
refined  forecaster  AQ  ,  let  us  define  the  stochastic  transformation  h 

as  follows : 

h(pjy)  -  1  for  ycz  ,  (3  8) 

h(x|y)  =  0  for  x  ^  y  . 


Then  it  follows  from  (3.1)  that  (3.5)  and  (3.6)  are  satisfied  when  v  is 

D 

replaced  by  v  as  defined  by  (3.2). 

Ao 


Similarly,  to  see  that  the  most  refined  forecaster  A^  is  at  least  as 
refined  as  any  other  forecaster  B  ,  let  us  define  the  stochastic  trans¬ 
formation  h  as  follows. 


h(xj  1)  =  -jj  xvfi(x)  for  xe£  , 
h(x| 0)  -  (l-x)vB(x)  for  xe£  . 


(3.9) 


Since  B  is  well-calibrated,  it  follows  from  (3.1)  that  the  function  h 

defined  in  (3.9)  has  the  properties  required  of  a  stochastic  transformation. 

The  definition  of  h(x|y)  for  y  0,1  is  irrelevant  since  forecaster  A° 

never  makes  a  prediction  other  than  0  or  1  .  The  relations  (3.5)  and  (3.6) 

will  now  be  satisfied  when  v  is  replaced  by  v  n 

A  A 


as  defined  by  (3.3). 
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Since  the  relationship  among .well-calibrated  forecasters  defined  by 
the  concept  of  one  being  at  least  as  refined  as  another  is  both  reflexive 
and  transitive,  this  relationship  induces  a  partial  ordering  among  those 
forecasters.  We  do  not  obtain  a  total  ordering,  however,  as  the  next 
example  shows. 

Suppose  that  A  and  B  are  well-calibrated  forecasters  characterized 
by  the  following  probability  functions: 


and 


vA(*> 


vbM 


.1 

for 

X  “ 

0  , 

.8 

for 

X  = 

.5, 

(3.10) 

.1 

for 

X  = 

1  , 

.5 

for 

X  = 

•  1, 

(3.11) 

.5 

for 

X  = 

.9. 

Here  u  *  .5 


In  this  example,  A  is  not  at  least  as  refined  as  B  .  To  see  this, 
suppose  on  the  contrary  that  there  were  a  stochastic  transformation  h(x|y) 
that  satisfied  (3.5)  and  (3.6),  and  let 


a  -  h(.ljO),  b  *  h(.l| .5) ,  and  c-h(.l|l)  .  (3.12) 

Then  for  x  ■  .1,  the  relations  (3.5)  and  (3.6)  become 

(.l)a  +  (. 8)b  +  ( . 1) c  -  .5  ,  13j 

(. A)b  +  ( •  1) c  =*  .05  . 

The  two  equations  in  (3.13)  imply  that  a-c*4  ,  which  is  an  impossibility 
since  both  0  _<  a  <_  1  and  0  ^  c  £  1  . 

On  the  other  hand,  neither  is  B  at  least  as  refined  as  A  .  To  see 


this,  we  need  only  note  that  on  20  percent  of  the  days,  A  makes  predictions 
of  rain  or  no  rain  that  are  certain  to  be  correct  (because  A  is  well-calibrated) 
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whereas  B  never  makes  correct  predictions  with  certainty.  Thus,  in 
this  example  neither  A  nor  B  is  at  least  as  refined  as  the  other. 

4.  Sufficiency 

In  Section  2  we  characterized  the  predictive  behavior  of  any  fore¬ 
caster,  regardless  of  whether  or  not  he  was  well-calibrated,  by  the 
functions  v(x)  and  p (x)  .  In  effect,  we  represented  the  joint  dis¬ 
tribution  of  the  prediction  x  and  the  indicator  of  rain  in  terms  of  the 
marginal  distribution  of  x  and  the  conditional  probability  p(x)  of 
rain  given  the  prediction  x  .  But  it  is  also  useful  at  times  to  use  an 
alternative  factorization  of  this  joint  distribution  (see,  e.g.,  Lindley, 

Tversky,  and  Brown,  1979,  and  Lindley,  1981). 

Let  9  denote  the  indicator  of  rain,  so  9=1  if  rain  occurs  on 

a  particular  day  and  9=0  otherwise,  and  for  any  given  forecaster  let 
f(x|9)  denote  the  conditional  probability  function  of  the  forecaster's 
predictions  given  9  .  In  other  words,  for  9=1,  f(x|9)  represents  the 

frequency  function  of  the  forecaster's  predictions  on  days  when  rain  actually 
occurs.  It  follows  that  for  xeX  , 

yf(xjl)  ■  p(x)v(x)  ,  (4.1) 

(l-n)f(x|0)  =  [l-p(x)]v(x)  .  (4.2) 

It  follows  from  (4.1)  that  the  probability  functions  f(x|9)  for  9=0 
and  9=1  characterize  the  forecaster's  predictive  behavior. 

Now  consider  two  forecasters  A  and  B  characterized  by  the  functions 
f.(x|9)  and  f  (x | 9 )  .  Following  the  original  work  of  Blackwell  (1951,  1953) 
on  the  comparison  of  experiments,  we  say  that  forecaster  A  is  sufficient 
for  forecaster  B  if  there  exists  a  stochastic  transformation  h(x|y)  such  that 
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E  h(x|y)f . (y ] 0)  =  f  (x)6)  for  xeZ  and  6=0,1  (4.3) 

yzZ  A  B 

(see,  e.g.,  DeGroot,  1970,  Sec.  14.17).  The  interpretation  of  (4.3)  is 
similar  to  that  given  in  Section  3:  forecaster  A  is  sufficient  for  fore¬ 
caster  B  if  we  can  simulate  the  predictions  of  B  from  the  predictions  of 
A  by  using  an  auxiliary  randomization  based  on  the  stochastic  transformation  h  . 

As  before,  the  relationship  of  sufficiency  induces  a  partial  ordering 
among  all  forecasters.  Since  we  have  applied  this  relationship  to  all 
forecasters,  however,  and  not  just  to  those  who  are  well-calibrated,  it  is 
not  necessarily  true  that  if  A  is  sufficient  for  B  then  A  is  at  least 
as  "good"  a  forecaster  as  B  .  For  example,  suppose  that  A  never  makes 
a  prediction  other  than  x»0  or  x=l  ,  but  that  he  is  always  wrong  about 
whether  or  not  it  is  going  to  rain.  Then  A  is  sufficient  for  every  other 
forecaster,  even  though  he  is  the  worst  possible  forecaster.  Of 
course,  if  we  knew  that  A  was  always  wrong,  his  predictions  would  be  just 
as  useful  to  us  as  those  of  a  forecaster  who  was  always  correct. 

Theorem  1.  Consider  two  forecasters  A  and  B  whose  predictive  be¬ 
havior  is  characterized  by  the  functions  vA(x),  PA(X)  »  vb^  ’  and  PB^  ' 
Then  forecaster  A  is  sufficient  for  forecaster  B  if  and  only  if  there 
exists  a  stochastic  transformation  h  such  that  the  following  relations  are 
satisfied  : 

Z  h(x|y)vA(y)  =  vfi(x)  for  xzZ  ,  (4.4) 

yzZ 

Z  h(x|y)p  (y)v  (y)  = 

«_  'V’  A  A 

ye  £ 


pb(x)vb(x)  for  xzZ  . 


(4.5) 
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Proof.  Consider  any  fixed  value  xcZ  .  It  follows  from  (4.1)  that 
for  9-1  ,  the  relation  (4.3)  is  the  same  as  (4.5).  Furthermore,  it 
follows  from  (4.2)  that  for  9-0  ,  the  relation  (4.3)  is  the  same  as 
the  relation 

E  h(xj y) [l-pA(y) ]vA(y)  -  [  1-p  (x) ] vg(x)  ,  (4.6) 

ye% 

which,  in  view  of  (4.5),  is  equivalent  to  (4.4).  M 

Recall  now  that  a  forecaster  is  well-calibrated  if  p.(x)  *  x  for 

A 

all  xe%*"  .  The  following  result  follows  immediately  in  the  light  of 
relations  (3.5)  and  (3.6). 

Theorem  2.  Consider  two  well-calibrated  forecasters  A  and  B  . 

Then  forecaster  A  is  at  least  as  refined  as  forecaster  B  if  and  only 
if  forecaster  A  is  sufficient  for  forecaster  B  . 

5.  Conditions  for  sufficiency 

In  this  section  we  shall  again  consider  two  well-calibrated  forecasters 
A  and  B  .  In  order  to  determine  whether  or  not  A  is  sufficient  for  B 
based  on  the  discussion  in  the  previous  sections,  it  is  necessary  to  determine 
whether  or  not  there  exists  a  stochastic  transformation  that  satisfies  either 
the  relations  (3.5)  and  (3.6)  or  the  relations  (4.3).  Attempts  to  establish 
the  existence  or  non-existence  of  such  a  stochastic  transformation  can  be 
frustrating  and  fruitless.  Fortunately,  Blackwell  and  Girshick  (1954)  and 
Bradt  and  Karlin  (1956)  have  provided  some  direct  methods  for  determining 
whether  or  not  A  is  sufficient  for  B  that  eliminate  the  necessity  of 
having  to  consider  stochastic  transformations. 
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For  any  forecaster,  A  ,  let 

aA(x)  =  fA(x|l)  +  fA(x|0)  for  xeX  ,  (5.1) 

and  for  0  _<  t  _<  1  ,  let  «^(t)  denote  the  subset  of  points  in  X  such 
that  fA(x|l)  <  t  aA(x)  -  Furthermore,  let 

F . (t)  *  Z  a  (x)  for  0  <  t  <  1  (5.2) 

and 

C.(t)  *  /  F  (u) du  for  0<t<l.  (5.3) 

A  0 

Then,  as  demonstrated  in  Theorem  12.4.1  of  Blickvell  and  Girshick  (1954), 

forecaster  A  is  sufficient  for  forecaster  B  if  and  only  if  CA(t)  ^  Cg(t) 

for  all  t  in  the  interval  0  <_  t  _<  1  . 

A  brief  heuristic  interpretation  of  this  result  is  as  follows;  Suppose 

that  the  parameter  6  has  prior  probabilities  given  by  Pr(0=l)«Pr  (0-0)**%  . 

Then  %a.(x)  is  the  marginal  distribution  of  x  for  forecaster  A  . 

A 

Furthermore,  if  we  let  *A(x)  denote  the  posterior  probability  Pr(0-l|x) 

for  forecaster  A  ,  then  J . (t)  denotes  the  set  of  values  of  x  for  which 

A 

ir  (x)  <  t  .  It  can  now  be  seen  from  (5.2)  that  *jF  (t)  is  the  distribution 

A  A 

function  of  the  posterior  probability  irA(x)  for  forecaster  A  .  For  an 
informative  forecaster,  the  values  of  irA(x)  will  tend  to  be  concentrated 
near  0  and  1  ,  and  away  from  their  mean  value  Ea[ita(x)]  "  **  •  The  con¬ 
dition  that  C.  (t)  _>  C  (t)  for  all  t  in  the  interval  0  <_  t  _<  1  is 
equivalent  to  the  condition  that  E  {cp[ir  (x)]}  >  E  {cp[ir  (x)]}  for  every 
continuous  convex  function  cp  .  In  this  sense,  the  condition  expresses  the 
notion  that  the  probability  distribution  of  uA(x)  is  more  spread  out  from  \ 
than  the  probability  distribution  of  it  (x)  . 

D 
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We  are  now  ready  to  establish  the  main  result  of  this  section.  Recall 
that  the  set  X  comprises  the  points  x^  <  x^  <  . . .  <  x^  . 

Theorem  3.  Consider  two  well-calibrated  forecasters  A  and  B  . 

Then  forecaster  A  is  sufficient  for  forecaster  B  if  and  only  if  the 
following  inequalities  are  satisfied: 
j-1 

±£q  (x^-xi)  [  \>A(xi)-vB(xi)  ]  >  0  for  j-1,...,  k  -  1  .  (5.4) 

Proof.  Since  both  A  and  B  are  well-calibrated,  it  follows  from 
(4.1)  and  (4.2)  that 


fA(x|D 

VX)" 


fB(x|l) 

aB(x) 


_ (x/pj _ _ 

(x/y)  +  [ (1-x) / (1-y ) 


(5.5) 


whenever  a  (x)  and  a  (x)  are  non-zero.  Even  if  either  a  (x)  or 
A  d  A 

a  (x)  is  zero  for  some  xzZ  ,  without  loss  of  generality  we  still  may 
define  (5.5)  to  be  satisfied.  Next,  for  0  _<  t  <_  1  and  xcZ  ,  define 


■<*•*>  - c  Ur  (1-°f 


(5.6) 


Then  both  the  sets  «^(t)  and  «£(t)  contain  precisely  those  points  xeZ 

for  which  s(t,x)>0  .  Since  the  sets  (t)  and  </_(t)  are  identical,  we 

A  a 

shall  denote  this  common  set  simply  by  ^(t)  . 

From  (5.2)  and  (5.3)  we  can  write 

t 

CA(C)  *  A)*  E  a»(x)]du  . 
xe^tu) 

For  each  xe ^  <*A(x)  contributes  to  the  integral  over  a  certain  set  of 

u- values  of  length  t-f (x|l)/oA(x).  Thus  we  can  re-express  C  (t)  as 


C  (t)  -  l  [to  (x)  -  f .  (x|l)  ]  . 
A  xcAt)  A  A 


(5.7) 


Next,  using  (4.1)  and  (4.2)  and  the  fact  that  A  is  well-calibrated,  we  can 
rewrite  (5.7)  as  follows: 
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C  (t)  =  I  s(t,x)v.(x) 

.  A  XE^(t)  A 


(5.8) 


Furthermore,  if  we  rewrite  s(t,x)  ,  as  given  by  (5.6),  in  the  form 
S(t,x)  "  y(l-y7  “  [tP+(l-t)(l-M)]x}  , 


(5.9) 


then  it  can  be  seen  that  V(t)  contains  precisely  these  points  xeZ  for 
which  the  quantity  inside  braces  in  (5.9)  is  positive.  Thus,  C.(t)  can 
be  expressed  as  follows: 

C  (t)= — 7i — z  (tu-[tu+(l-t)  (1-y)  ]x}+v  (x)  ,  (5.10) 

A  vu-v)  a 

where,  as  usual,  the  notation  (m)+  denotes  the  positive  part  of  the 
quantity  m  . 

For  forecaster  B  ,  the  function  C  (t)  is  also  given  by  (5.10)  with 

D 

vA(x)  replaced  by  \>fi(x)  .  Let 


LA(t)  -  y(l-p)  CA(t) 


(5.11) 


and  let  LgCt)  be  defined  similarly.  Then  it  follows  from  Theorem  12.4.1  of 
Blackwell  and  Girshick  (1954),  as  cited  earlier,  that  forecaster  A  is 
sufficient  for  forecaster  B  if  and  only  if  LA(t)  1.  Lg(t)  for  t  in 

the  interval  0  <_  t  <_  1  . 

Corresponding  to  the  points  0^Xq<x^<...<x^j«1  in  X  ,  let  the 
points  0  _<  t^  <  t^  <• . .  <  <_  1  be  defined  by  the  relations 

tjP  -  [tjP+(l-tj) (1-p) ]Xj  -  0  for  j  ■  0,1,...,  k  .  (5.12) 

Then  both  LA(t)  and  Lg(t)  are  continuous,  piecewise  linear  functions  over 
the  interval  0  <  t  £  1  with  LA(0)  “  L^(0)  *  0  and  LA(1)  “  Lg(l)  ”  1  »  and 


\ 
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with  vertices  at  the  points  tg,  t^,...,  t^  .  Furthermore,  L^(tg)  * 

LB(tg)  *  0  and,  for  j  *  1,...,  k  , 

)  ,  (5.13) 

with  an  analogous  expression  for  forecaster  B  . 

Finally,  we  note  that  when  j“k  ,  the  right-hand  side  of  (5.13)  can 
be  reduced  as  follows  by  using  (3.1): 

±lQ  (VXi)VA(X!)  *  Xk[l'VA(xk)]  -  ty~VA(xk)]  (5-U) 

’  *k  -  V  ‘ 

Thus,  L^ft^)  **  Lg(t^)  .  We  have  now  established  that  forecaster  A  is 
sufficient  for  forecaster  B  if  and  only  if  LA(tj)  _>  L^t^)  for  j»l,...,  k-1 
It  follows  from  (5.13)  that  these  k-1  inequalities  are  equivalent  to  the 
k-1  inequalities  (5.4).  m 

We  make  use  of  this  theorem  in  the  next  section. 

6.  The  least-refinedf  well-calibrated  forecaster 

In  Section  3,  we  required  that  yc£  in  order  to  ensure  that  the  least- 
refined  forecaster  who  is  well-calibrated  will  always  announce  y  as  his 
forecast.  Suppose  now  that  yf%  ,  and  let  x^  and  Xg  be  the  pair  of 
adjacent  values  in  X  just  bracketing  y  ,  i.e. ,  x^  <  u  <  x^  .  Then  let 
Ag  be  a  well-calibrated  forecaster  characterized  by  the  probability  function: 

VA’(V  "  (lJ“XL)/(VXL>  * 

VA‘(XL)  "  <VW>/(XU"*L)  ’ 

vA,  (x)  •  0  for  xj*Xg  or  Xj^ 


LA(t  ) 


^wKl-tj)  (1-y) 


J-l 

I 

i=*0 


(xj'Xi)vA(xi 


(6.1) 


It  can  be  seen  from  (6.1)  that  concentrates  his  forecasts  as 

closely  as  possible  to  p  given  the  permissible  forecast  values.  Thus, 
there  is  an  intuitive  sense  in  which  is  not  as  refined  as  another 

forecaster  who  spreads  out  his  probabilities  over  at  least  some  of  the 
other  values  of  x  .  We  make  this  notion  precise  in  the  following 
theorem  • 

Theorem  4.  The  well-calibrated  forecaster  A^  ,  whose  probability 

function  v  ,  (x)  is  given  by  (6.1),  is  least  refined  among  all  other  well- 
0 

calibrated  forecasters. 

Proof.  Consider  any  other  well-calibrated  forecaster  A  .  Then,  from 
Theorem  2,  A  is  at  least  as  refined  as  A^  if  and  only  if  A  is  sufficient 
for  Aq  ,  and,  from  Theorem  3,  this  is  true  if  and  only  if 


J-l 

Z  (x  — x  )[v  (x  )-v  ,(x  )]  >_  0  for  j-1,2 . k-1 

i-0  J  1  A  1  0  1 

To  verify  (6.2),  we  note  that  for  j«l,...,  L  , 

j-l  j-1 

Z  (x  -x  )[v  (x  )-v  , (x  ) ]  -  Z  (x  -x  )v  (x  )  , 
i*0  J  1  A  1  Ao  1  i-0  J  1 


(6.2) 


(6.3) 


which  clearly  is  nonnegative  since  x^  >  xA  and  vA(x)  is  a  probability 
function.  For  j»U  ,  recalling  that  U  *  L+l  ,  we  have 


■  -xl)uA'(V 


(6.4) 


,^-lW  "  <VB) 


! 


A 
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where  the  final  expression  follows  from  (6.1).  Since  A  is  well-calibrated, 
we  can  now  use  (3.1)  to  rewrite  the  expression  following  the  final  equality 
sign  in  (6.4)  as  follows: 


L 

Z 

i=0 


(VXi)vA(xi) 


k 

E 

i=0 


k 

r  (x±-^) 

i=U  1  u 


(6.5) 


Similarly,  for  j^O+l,...,  k  ,  the  left-hand  side  of  expression  (6.2)  equals 


k 

E  (x  -x  )v  (x  )  _>  0  .  m 
i-j  1  J 

The  use  of  Theorem  3  is  critical  to  the  preceding  proof,  for  otherwise 

we  would  need  to  construct  the  actual  stochastic  transformation  h(x|y) 

going  from  to  ,  simultaneously  ensuring  that  the  calibration  con- 

o 

dltion  holds.  We  have  found  this  to  be  a  nontrivial  task. 

7.  Scoring  rules  for  assessment 

In  the  television  station  example  introduced  in  Section  1,  we  get  to  see 
a  finite  set  of  forecasts  and  the  associated  indicators  of  whether  or  not  rain 
occurred,  i.e.,  { (Pj ,  3  ”  1,2,...,  n).  Several  authors  have  suggested 

scoring  rules  to  be  used  to  assess  probability  assessors  in  such  situations. 
Here  we  relate  some  of  these  to  the  probabilistic  concepts  of  calibration  and 
refinement. 

One  of  the  earliest  scoring  rule  proposals  suggested  in  the  context  of 


meteorological  forecasts  is  the  "Brier  Score", 
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BS 

n 


(7.1) 


which  Che  forecaster  is  Co  attempt  to  minimize  (Brier,  1950).  In  the  case 
of  binary  outcomes  (rain,  no  rain),  Winkler  (1967)  notes  that  the  Brier  Score 
is  equivalent  to  the  general  quadratic  scoring  rule  proposed  by  de  Finetti 
(1965),  designed  to  oblige  the  forecaster  "to  express  his  true  feelings" 

(de  Finetti,  1962). 


Other  general  classes  of  "strictly  proper"  scoring  rules  include 
Good's  (1952)  logarithmic  scoring  rule  and  the  spherical  scoring  rule  (see 
Stael  von  Holstein  1970,  and  Savage  1971). 

If  we  let  n^  equal  the  number  of  days  out  of  n  on  which  the  fore¬ 
caster  predicts  rain  with  probability  x.^  ,  and  r^  the  number  of  these 
n^  days  on  which  it  actually  does  rain,  we  can  rewrite  the  Brier  Score  of 
(7.1)  as 


BS 

n 


ai<xi 


k 

£ 

i-0 


(7.2) 


or  as 


BS 


,  k  r.  2  ,  k  r.  2 

1  _  ,  iN  ,  r  ..  r._l  _  ,  i  r. 

-  £  n,  (x,  -  — )  +  -  (1  -  -)  -  2  n  (— - -)  , 

n  i-0  1  1  ni  n  n  n  i-0  1  ni  n 


(7.3) 


where  £  r  -  r  .  Tukey,  Mosteller,  and  Fienberg  (1965)  suggest  a  variant 
i-0 

of  (7.2)  which  essentially  allows  the  two  components  on  the  right-hand  side 
to  be  given  different  weights. 

To  understand  how  the  components  of  the  Brier  Score  relate  to  the  con¬ 
cepts  discussed  here  we  let  n-*»  in  such  a  manner  that  r^n^  *  p(*^)  and 


n^/n  ■*  v(Xj)  .  Then 
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and 


lim 

n-H» 


n 


p(x  )v(x±) 


lim 

n-H» 


2 


P 


(xi)v(xi)  . 


(7. A) 


(7.5) 


Any  sampling  scheme  of  trials  with  these  limiting  properties  suffices  for 
our  purposes.  Thus,  from  (7.2)  we  have 

BS  -  lim  BS 

it-**  n 

k  2  k 

*  Z  v(x  )  [x  -  p(x  )]  +  Z  v(x  )p(x  )[l  -  p(x  )]  .  (7.6) 

i-0  1  i-0  1  1 


The  first  term  on  the  right-hand  side  of  (7.6)  is  the  weighted  mean 
square  difference  between  the  forecasted  probability  and  the  frequency 

of  rain  p(x^)  .  As  such  it  is  a  measure  of  calibration.  If  the  forecaster 
is  well-calibrated,  this  term  equals  zero. 

The  second  term  on  the  right-hand  side  of  (7.6)  measures  the  dispersion 
of  the  results  of  the  forecaster's  predictions.  As  such  it  rewards  the  fore¬ 
caster  for  spreading  his  predictions  as  much  as  possible,  and  thus  is  a  measure 
of  the  forecaster's  refinement.  The  following  theorem  shows  that  there  Is  a 
direct  relationship  between  this  term  and  the  concepts  of  refinement  and 
sufficiency  presented  in  Sections  3  and  4. 

Theorem  5.  If  forecaster  A  is  sufficient  for  forecaster  B  ,  then 

Z  v  (x)p  (x)[l  -  p , (x) 3  £  l  v  (x)p  (x)[l  -  p  (x)]  .  (7.7) 


Proof;  Since  A  is  sufficient  for  B  ,  from  (4.5)  we  have 
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1  pb^vb^  =  z  1  M*Iy)PA(y)v  (y) 
x  x  y 


E  [E  h(xjy) ]p. (y)v  (y) 

y  x 


=  £  pA^VA^ 

y 


Next,  by  applying  both  (4.4)  and  (4.5)  we  have 

2 

[E  h(x|y)p . (y)v.(y) ] 


A  J  A 

V^b'*1  *  •"?Tt(»|y)»A(y> 


[E  h(x|y)v.  (y)  ]  [I  <,<y)V 


y  E  h  (x[y  ’  )  v(y 1 ) 

y'  A 


<tl  h(x|y)v.(y)]  (I  Mx|y)vAfr)  «?W) 

y  y  E  h(x|y')v  (y') 

y' 


-  E  h(x|y)vA(y)p^(y)  , 

y 


where  the  Inequality  is  a  special  case  of  Jensen's  inequality, 
over  x  now  yields 


E  vfi(x)pg(x)  <  E  E  h(x|y)vA(y)p^(y) 


x  y 


A  A 


-  E  [Eh(x|y)]v  (y)pA(y) 

y  x 

-  E  vA(y)pA(y)  . 

y 

Finally,  combining  (7.8)  and  (7.10)  yields  the  inequality  (7.7) 


(7.8) 


(7.9) 

Summing  (7.9) 


(7.10) 
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We  recall  from  Section  5  that  forecaster  A  is  sufficient  for  forecaster 
B  if  and  only  if  C^( t )  _>  Cg(t)  ,  where  C^(t)  is  defined  by  (5.3)  ,  and 
that  this  condition  is  equivalent  to 

EA{cp[ifA(x^^  2.  EB^cp^TtB^x^  ^  (7.11) 

for  every  continuous  convex  function  cp  .  Theorem  5  is, in  effect, a  special 
case  of  this  equivalence.  From  (7.11)  we  can  construct  a  class  of  generalized 
limiting  scoring  rules  that  replace  the  second  term  of  (7.6)  by 

Z  v(x)cp[p(x)].  (7.12) 

xcZ 

The  actual  assessment  of  probability  assessors  based  on  a  finite  set 
of  forecasts  requires  a  careful  description  of  the  stochastic  mechanisms 
associated  with  the  production  of  predictions  for  the  forecasters  being 
compared.  We  shall  present  such  a  description  in  a  separate  paper. 

8.  Multivariate  forecasts 

In  the  preceding  sections  we  have  considered  events  with  s  *  2  possible 
outcomes  (e.g. ,  rain,  no  rain).  Yet  climatological  forecasting  often  involves 
s  >  2  outcomes  (e.g.,  rain,  snow,  and  neither  rain  nor  snow,  or  a  set  of 
temperature  ranges).  In  such  situations  the  probability  assessor  specifies 
a  vector  of  probabilities  x  ,  restricted  to  a  finite  set  of  values 
lying  in  the  (s-1) “dimensional  simplex.  If  the  conditional  probabilities  of 
the  s  outcomes  given  the  prediction  x  is  represented  in  vector  form  by 
p(x)  ,  then  the  multivariate  forecaster  is  well-calibrated  if  p(x)  »  x  for 
all  xtX  .  Note  that  this  well-calibrated  multivariate  forecaster  is  also 
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well-calibrated,  in  the  sense  of  Section  2,  for  each  binary  problem  formed 
by  combining  the  s  outcomes  into  two  groups;  however,  a  forecaster  who 
is  "marginally"  well-calibrated  for  predicting  "rain"  or  "no  rain"  may  no 
longer  be  well-calibrated  when  "no  rain"  is  divided  into  two  or  more  possible 
outcomes. 

More  formally,  let  x  ■  (x,  , — ,  x  )  and  p  (x)  *  [p_  (x) . p  (x)J. 

_  1  s  __  1.  s  _ 

Furthermore,  let  {I  , 1^}  represent  a  partition  of  the  set 
{1,...,  s}  into  k  nonempty,  mutually  exclusive,  and  exhaustive  sets 
1^,...,  1^  .  Then  a  forecaster  is  said  to  be  marginally  well-calibrated 
with  respect  to  the  partition  J  if 

E  p  (x)  -Ex  for  j  -  1,...,  k  and  xe£.  (8.1) 

ielj  1  ~  ielj  1 

Similarly,  we  can  develop  the  notion  of  conditionally  well-calibrated 
forecasters.  Consider  again  the  problem  treated  in  Sections  2-7,  in  which 
s  -  2  and  the  forecaster  simply  specifies  his  probability  x  of  rain.  The 
forecaster  may  be  well-calibrated  for  some,  but  not  all,  values  of  x  .  In 
other  words ,  it  may  be  true  that  p (x)  -  x  when  x  belongs  to  some  subset 
Xq  of  X ,  but  not  for  all  values  of  xeZ.  In  this  case,  we  may  say  that 
the  forecaster  is  conditionally  well-calibrated,  given  that  xeX^  • 

Now  consider  the  general  multivariate  forecasting  problem  introduced  in 
this  section.  Let  the  partition  J  be  as  defined  here,  and  let  7^  denote 
a  proper  subset  of  X  .  Then  a  forecaster  is  said  to  be  conditionally  well- 
calibrated  with  respect  to  the  partition  J  ,  given  that  XE^  ,  if  the 
relation  (8.1)  is  satisfied  for  j  -  1,...,  k  and  all  xe^*q  * 
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For  well-calibrated  multivariate  forecasters ,  we  can  define  the  concept  of 
refinement  by  means  of  a  multivariate  stochastic  transformation.  Moreover, 
this  notion  of  refinement  can  again  be  directly  linked  to  sufficiency  in  the 
comparison  of  experiments  with  a  finite  number  of  outcomes.  Finally,  the 
concept  of  one  forecaster  being  marginally  or  conditionally  more  refined 
than  another  can  be  developed. 

Critical  to  the  multivariate  versions  of  calibration  and  refinement  as 
proposed  in  this  section  is  the  orientation  of  the  vector  of  forecasted 
probabilities  x  .  Each  component  of  x  refers  to  a  specific  outcome. 

This  methodology  should  be  contrasted  with  the  multivariate  approach, 
described  for  example  by  Lichtenstein,  Fischhoff,  and  Phillips  (1977), 
in  which  the  forecaster  Tselects  the  single  most  likely  alternative  and 
states  the  probability  that  it  is  correct.*  Kadane  and  Lichtenstein  (1981) 
show  that  such  a  loss  of  orientation  leads  to  the  inability  to  recalibrate 
a  forecaster's  assessments.  From  the  discussion  here,  it  should  be  clear 
that  a  careful  description  of  calibration  and  refinement  in  both  the  binary 
and  multivariate  settings  requires  a  well-specified  set  of  outcomes,  and 
probability  assessments  specifically  tied  to  those  outcomes. 
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